# Quiz 4 Review

Suvinay Subramanian

6.823 Spring 2016

- » VLIW
  - Loop unrolling, software pipelining, predicated execution, speculative execution, trace scheduling
- » Vector Computers
  - Vector lanes, vector length register, masking, chaining
- » GPUs
  - Warps, branch divergence (masking)
- » Transactional Memory
  - Eager, lazy versioning
  - Optimistic, pessimistic conflict detection

- » Abstraction layer between hardware and architecture of computer
  - i.e. separates ISA from actual hardware implementation details.
- » Layer of hardware-level instructions that implement higher-level (i.e. ISA) instructions

### Microcode Implementation



Control Signals (17)

### Microcode Fragments

| State                                                        | Control points                                  | next-state            |
|--------------------------------------------------------------|-------------------------------------------------|-----------------------|
| fetch <sub>0</sub>                                           | $MA \leftarrow PC$                              | next                  |
| fetch <sub>1</sub>                                           | $IR \leftarrow Memory$                          | spin                  |
| fetch <sub>2</sub>                                           | $A \leftarrow PC$                               | next                  |
| fetch <sub>3</sub>                                           | $PC \leftarrow A + 4$                           | dispatch              |
| <br>ALU <sub>0</sub><br>ALU <sub>1</sub><br>ALU <sub>2</sub> | A ← Reg[rs]<br>B ← Reg[rt]<br>Reg[rd]←func(A,B) | next<br>next<br>fetch |
| ALUi <sub>0</sub>                                            | A ← Reg[rs]                                     | next                  |
| ALUi <sub>1</sub>                                            | B ← sExt <sub>16</sub> (Imm)                    | next                  |
| ALUi <sub>2</sub>                                            | Reg[rd]← Op(A,B)                                | fetch                 |

- » VLIW
  - Loop unrolling, software pipelining, predicated execution, speculative execution, trace scheduling
- » Vector Computers
  - Vector lanes, vector length register, masking, chaining
- » GPUs
  - Warps, branch divergence (masking)
- » Transactional Memory
  - Eager, lazy versioning
  - Optimistic, pessimistic conflict detection

#### VLIW

» Premise: Static instruction scheduling + superscalar execution to extract ILP.

- » Tradeoff: Complex hardware vs Complex compiler "Conservation of complexity"
  - VLIW machines: Compilers figure out independent instructions and schedules them suitably

## VLIW Hardware



- Multiple operations packed into one instruction
- Each operation slot is for a fixed function
- Constant operation latencies are specified

## VLIW Software

- » Key Questions:
  - How do we find independent instructions to fetch/execute?
  - How to enable more compiler optimizations?
- » Key Ideas:
  - Get rid of control flow
    - Predicated execution, loop unrolling
  - Optimize frequently executed code-paths
    - Trace scheduling
  - Others: Software pipelining, speculative execution

## Loop Unrolling

» Unroll loop to perform M iterations at once

- Get more independent instructions
- Need to be careful about case where M is not a multiple of number of loop iterations



## Loop Unrolling

|                    |          | Int1   | Int 2 | M1    | M2 | FP+     | FРX |
|--------------------|----------|--------|-------|-------|----|---------|-----|
| loop: ld f1, 0(r1) |          |        |       |       |    |         |     |
| ld f2, 8(r1)       | loop:    |        |       | ld f1 |    |         |     |
| ld f3, 16(r1)      |          |        |       | ld f2 |    |         |     |
| ld f4, 24(r1)      |          |        |       | ld f3 |    |         |     |
| add r1, 32         |          | add r1 |       | ld f4 |    | fadd f5 |     |
| fadd f5, f0, f1    | Schedule |        |       |       |    | fadd f6 |     |
| fadd f6, f0, f2    | Schedule |        |       |       |    | fadd f7 |     |
| fadd f7, f0, f3    |          |        |       |       |    | fadd f8 |     |
| fadd f8, f0, f4    |          |        |       | sd f5 | 1  |         |     |
| sd f5, 0(r2)       |          |        |       | sd f6 |    |         |     |
| sd f6, 8(r2)       |          |        |       | sd f7 |    |         |     |
| sd f7, 16(r2)      |          | add r2 | bne   | sd f8 |    |         |     |
| sd f8, 24(r2)      |          | add TE | DITO  | 5010  |    |         |     |
| add r2, 32         |          |        |       |       |    |         |     |
| bne r1, r3, loop   |          |        |       |       |    |         |     |

In+1

1-+ 0

N / - 1

140

ED.

#### You should be familiar with filling in such a table

5/10/16

EDv

## Software Pipelining



## Predicated, Speculative Execution

- » Limited ILP within a basic-block; branches limit available ILP
- » Predication: Eliminate hard-to-predict branches by converting control dependence to data dependence
  - Each instruction (within the branch basic block) has a predicate bit set
  - Only instructions with true predicates are executed and committed.
- » Speculation: Move instructions above branches to explore more ILP options

## Trace Scheduling

» Idea: For non-loop situations:

- Find common path in program trace
- Re-align basic blocks to form straight-line trace
  - Trace: Fused basic-block sequence
- Schedule trace
- Create fixup code in case trace != actual path



### Trace Scheduling



#### A, C, D superblock (trace)

- » VLIW
  - Loop unrolling, software pipelining, predicated execution, speculative execution, trace scheduling
- » Vector Computers
  - Vector lanes, vector length register, masking, chaining
- » GPUs
  - Warps, branch divergence (masking)
- » Transactional Memory
  - Eager, lazy versioning
  - Optimistic, pessimistic conflict detection

### Vector Computers

- » Single-instruction multiple data (SIMD)
  - Single instruction to operate on an entire vector (instead of scalars)
- » Vector length register (VLR)
  Vector masking (conditional execution)
  Vector chaining
  Vector lanes

If we ask code, we will provide syntax, meaning for vector instructions

### GPUs

- » Single-instruction multiple thread (SIMT)
  - Multiple instruction streams of scalar instructions
- » Warps: A set of threads grouped together, and executing the same instruction (modulo divergence)

» Branch divergence: Handled through masking

- » VLIW
  - Loop unrolling, software pipelining, predicated execution, speculative execution, trace scheduling
- » Vector Computers
  - Vector lanes, vector length register, masking, chaining
- » GPUs
  - Warps, branch divergence (masking)
- » Transactional Memory
  - Eager, lazy versioning
  - Optimistic, pessimistic conflict detection

### Transactional Memory

- » Idea: No locks, only shared data Idea: Optimistic (speculative) concurrency
  - Execute critical section speculatively
  - Abort on conflicts
- » Key properties:
  - Atomicity (all or nothing)
  - Isolation (no other code can observe updates before commit)
  - Serializability

### Transactional Memory Taxonomy

- » Data Versioning
  - Eager
  - Lazy
- » Conflict Detection
  - Pessimistic
  - Optimistic

## Data Management Policy

How to manage the "tentative work" that a transaction does

- 1. Eager versioning (undo-log based)
  - Update memory location directly
  - Maintain undo info in a log
  - Fast commits
  - Slow aborts
- 2. Lazy versioning (write-buffer based)
  - Buffer data until commit in a write buffer
  - Update actual memory locations at commit
  - Fast aborts
  - Slow commits

## Conflict Detection Policy

How to ensure isolation between transactions

- Pessimistic detection Check for conflicts during loads or stores
- Optimistic detection Detect conflicts when a transaction attempts to commit

- » VLIW
  - Loop unrolling, software pipelining, predicated execution, speculative execution, trace scheduling
- » Vector Computers
  - Vector lanes, vector length register, masking, chaining
- » GPUs
  - Warps, branch divergence (masking)
- » Transactional Memory
  - Eager, lazy versioning
  - Optimistic, pessimistic conflict detection

Thank You! All the best ③